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Abstract 

Background: With the recent advances and availability of various high-throughput sequencing technologies, data 
on many molecular aspects, such as gene regulation, chromatin dynamics, and the three-dimensional organization 
of DNA, are rapidly being generated in an increasing number of laboratories. The variation in biological context, 
and the increasingly dispersed mode of data generation, imply a need for precise, interoperable and flexible 
representations of genomic features through formats that are easy to parse. A host of alternative formats are 
currently available and in use, complicating analysis and tool development. The issue of whether and how the 
multitude of formats reflects varying underlying characteristics of data has to our knowledge not previously been 
systematically treated. 

Results: We here identify intrinsic distinctions between genomic features, and argue that the distinctions imply 
that a certain variation in the representation of features as genomic tracks is warranted. Four core informational 
properties of tracks are discussed: gaps, lengths, values and interconnections. From this we delineate fifteen 
generic track types. Based on the track type distinctions, we characterize major existing representational formats 
and find that the track types are not adequately supported by any single format. We also find, in contrast to the 
XML formats, that none of the existing tabular formats are conveniently extendable to support all track types. We 
thus propose two unified formats for track data, an improved XML format, BioXSD 1.1, and a new tabular format, 
GTrack 1 .0. 

Conclusions: The defined track types are shown to capture relevant distinctions between genomic annotation 
tracks, resulting in varying representational needs and analysis possibilities. The proposed formats, GTrack 1.0 and 
BioXSD 1.1, cater to the identified track distinctions and emphasize preciseness, flexibility and parsing convenience. 



Background Several efforts have been attempted at defining general 

Recent ChIP and high-throughput sequencing technolo- formats for the textual representation of genome anno- 

gies are currently generating functional annotations at tation data. One such format is the General Feature For- 

unprecedented speed and resolution. The availability of mat (GFF), currently in version 3 [2]. Other generic 

detailed protein binding locations, DNA methylation, formats are provided in connection to the UCSC Gen- 

histone modifications, DNA variations of individuals, ome Browser [3], the Browser Extensible Data format 

and more for different tissues and conditions, provides (BED), bedGraph and WIG, among others. One reason 

the basis for a plethora of representational formats of for the different formats is that different properties are 

genome wide data. Adding to this, new technologies for required, often in order to support information related 

assessing the three-dimensional structure of the DNA, to specific domains, technologies or experimental meth- 

such as Hi-C [1], introduce the concepts of distance ods. Consider for instance the BED15 format by UCSC. 

measures between different parts of a genome, opening This is an extension of the BED format, adding 3 col- 

up a whole new set of representational complexity. umns in order to represent microarray expression data 

[4]. Other examples are the Gene Transfer Format 
(GTF) [5] for gene tracks and the Genome Variation 
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Another reason behind the proliferation of formats 
seems to be an issue of practicality. Certain types of 
genome annotations, or genomic tracks, are more effi- 
ciently and elegantly represented by certain data for- 
mats. Consider a track of DNA melting temperatures, i. 
e. an algorithmic prediction of the denaturation tem- 
perature for each base pair of the genome, e.g. [7]. 
Representing such a track in the Wiggle format (WIG) 
would take around 20 GB for the human genome. The 
exact same information could be represented in the bed- 
Graph format, but the file size would then expand to 
around 100 GB. In this case, the file would contain 
much redundant information, such as repeated chromo- 
some declarations, and start and end positions that are 
always increased by one for each line. The help pages at 
the UCSC Genome Browser explicitly recommend the 
WIG format for "dense, continuous data" and bedGraph 
for "continuous data that is sparse or contains elements 
of varying size" [8]. From this it seems that, at an 
abstract level, there may exist fundamental distinctions 
between track data, such that warrants the use of parti- 
cular textual formats. We are, however, not aware of 
any systematic discussion of such distinctions in the 
literature. 

Expanding on this notion of systematic distinctions 
between track data, it seems that such distinctions also 
warrant differences in which analyses are applicable. It 
is for instance meaningful to ask whether SNPs fall 
inside exons, but it is not meaningful to ask whether 
SNPs fall inside melting temperature. Conversely, one 
can ask whether SNP locations have high melting tem- 
peratures, but not whether SNPs have high exons. This 
indicates that there may be some form of abstract gram- 
mar, where each track defines a set of informational 
properties, and each analysis only makes sense on cer- 
tain sets of informational properties for the tracks in 
question. 

In this paper, we start with a clarification of basic 
nomenclature. We then discuss how the presence of dif- 
ferent core informational properties of a track can be 
used to delineate fifteen different types of tracks at an 
abstract level. The fifteen track types encompass most 
existing data formats, in addition to open up for data 
sets making use of cross -positional linking, e.g. data sets 
based on the three-dimensional structure of DNA. We 
continue by reviewing common, generic formats, in tab- 
ular, XML-based, or binary form, and discuss how they 
fit with the proposed track types. This is followed up 
with the proposal of a new tabular format and an 
updated XML format for track data. These formats 
build closely on previous ones, but obey the distinctions 
between types of tracks. Finally, we discuss supporting 
tools for the proposed formats, including a code base 
supporting the storage of tracks in efficient binary 



format, illustrating how the formats can be pragmati- 
cally applied in high-speed analyses. 

Results and Discussion 

Definitions 

A reference genome may be abstracted as a line-based 
coordinate system. To build on this powerful metaphor, 
we use the term genomic track (or, in short, track, as 
used by the UCSC Genome Browser [3]) to refer to a 
series of data units positioned on such a line. The basic 
informational unit is called a track element, that is, a 
unit of data with associated genomic coordinates that 
may or may not be explicitly specified. A track element 
is to be thought of as a mathematical or implementa- 
tional abstraction, in tabular formats typically repre- 
sented as a single data line. Although the concept of 
genomic tracks is most useful for describing data that 
refer to a single reference genome, the meaning carries 
easily over to datasets referring to multiple reference 
genomes, or to contigs or scaffolds of partially 
assembled genomes. 

We further define a genome feature as a track element 
or set of track elements comprising a biological unit, e.g. 
a specific gene, of a certain feature type, e.g. genes. The 
term biological unit is to be understood broadly and 
should also include experimental results, algorithmic 
predictions and similar concepts, such as defined under 
sequence feature in the Sequence Ontology [9]. Note 
that a feature, e.g a gene, may be composed of several 
track elements, e.g. representing the exons of that gene. 
Often, a complete genome annotation, i.e. features of 
many feature types connected to a genome, are collected 
into a single file. This complicates the comparison of 
different feature types, creating the need for filtering 
such a file for the appropriate feature types prior to ana- 
lysis. On the other hand, restricting a track to contain 
only a single feature type may reduce the information. 
For example, the connection between genes and their 
exons is lost if the two feature types are stored as sepa- 
rate tracks. We thus define a genomic track more speci- 
fically as set of track elements of one or several feature 
types, defined over an appropriate genome-scale coordi- 
nate system, where the set of feature types constitutes a 
pragmatic unit for analysis. A genomic track is then, in 
our view, defined in relation to an analytical purpose, 
whether explicitly defined or only suggested; this, in 
contrast to a data file used mainly for storage, which 
should be considered more as a flat file database. 

Core informational properties of tracks 

A genomic track consists of a set of track elements and, 
for each element, describes a set of properties, such as 
an identifier, a quality score or the method used. The 
positional information of a track element is obligatory 
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for any genomic track and can be interpreted generically 
across tracks. The position of a track element is often 
encoded as a pair of start and end coordinates. How- 
ever, when looking at genomic tracks from the perspec- 
tive of information content, we find it fruitful to identify 
the positional information equivalently as the lengths of 
the track elements and the gaps between them, both 
measured in base pairs. As the positional information is 
essential and generic, we refer to gaps and lengths as 
core informational properties of the track. 

A genomic track may also carry a main value asso- 
ciated with each track element, for instance the mea- 
sured expression of a gene or the copy number of a 
genomic region. We thus include values among the core 
informational properties. This main value can be a num- 
ber {e.g. the expression of a gene), a binary value {e.g. if 
the element is considered case or control), a category {e. 
g. the feature type), a character {e.g. the allele variant of 
a SNP), or a list of values {e.g. gene expression for a set 
of patients). 

Lastly, a track element may be connected to other 
track elements located at different locations on the gen- 
ome. This is critical for three-dimensional tracks, as 
locations that seem far apart when the DNA is 
unwound, could still be co-located in the nucleus. The 
corresponding core informational property of a track is 
then interconnections. The interconnections, or edges, 
are either directed or undirected, possibly with an 
attached weight value. 

Fifteen genomic track types 

All four core informational properties {gaps, lengths, 
values, and interconnections) will not always be defined 
for a track. Consider, for instance, a track of viral inser- 
tion points on a genome. As it makes no sense to talk 
about the length of an insertion point, such a track will 
not have the lengths property defined. Similarly, a track 
of single nucleotide polymorphisms (SNPs) will only 
contain elements that refer to single discrete positions 
on the genome. The track elements will, however, have 
associated values denoting the respective alleles. Con- 
sider also the DNA melting map, a track where a tem- 
perature value is assigned to every base pair of the 
genome [7]. As temperature values, i.e. track elements, 
are defined for every consecutive position of the gen- 
ome, there is never any gaps between the elements. 
Also, the elements refer to single base pairs and have no 
lengths. Thus, a track of DNA melting will have neither 
the lengths nor the gaps property defined, only the 
values property (denoting temperature). 

Four core properties, being defined or not, gives 2 4 = 
16 distinct combinations. Assuming that a genomic 
track always consists of track elements with the same 
core properties, we can distinguish tracks on the basis 



of which combination of core properties are defined. 
For one of the sixteen combinations, no core properties 
are defined. It is thus of no interest, hence reducing the 
set to fifteen combinations. 

Looking closely at the fifteen combinations, an inter- 
esting pattern appears. Figure 1 shows an illustration of 
the informational contents of each combination. As 
every combination denotes a particular geometric con- 
figuration, strikingly distinct from the others, we refer to 
tracks of the different combinations as having different 
track types. The concept of dividing genomic tracks into 
track types was partially introduced in [10], but has now 
been expanded from five to fifteen track types. 

Looking at the top left of Figure 1 and going down- 
ward, we start at the base case where the only core 
informational property is the gaps between the track ele- 
ments. In this case, each track element represents an 
exact base pair location on the genome, denoting e.g. 
viral insertion sites. We call this track type Points (P). 
Adding informative values to this case, e.g. associating 
SNPs with allele frequencies, we get the track type 
Valued Points (VP). In the next two cases, the lengths 
property is added, resulting in the track types Segments 
(S) and Valued Segments (VS). Segments are probably 
the most common track type of existing tracks, repre- 
senting common features such as genes or exons. 
Valued segments could, for instance, denote genes with 
associated expression levels. 

Moving on, we remove the values and gaps properties, 
leaving only lengths. Such tracks consist of segments 
covering all base pairs of the genome, i.e. a partition of 
the genome into potentially unequal pieces. Hence, the 
track type is called Genome Partition (GP). Basic exam- 
ples of this track type are the partition of a genome into 
chromosomes or cytobands. Adding a value to each part 
of a partition creates a Step Function (SF), covering the 
whole genome with values. Basic examples of such 
tracks are tracks denoting results of tiling microarrays, 
providing that any gaps or overlaps between the tiles are 
ignored. Removing the lengths core property, the step 
function track is transformed into a track of type Func- 
tion (F), where every base pair has an associated value. 
Examples of function tracks are tracks with close depen- 
dency on the genome sequence, such as GC content 
tracks, or predictions of melting temperatures, as out- 
lined above. We call the seven track types outlined here 
for basic track types. 

The fourth core informational property, interconnec- 
tions, can be envisioned as an orthogonal extension to 
the previous discussion. Adding interconnections, or 
edges, to the seven track types previously outlined (first 
column in Figure 1) defines linked versions of the same 
track types, e.g. Linked Segments (LS) or Linked Step 
Function (LSF) (second column of Figure 1). Although 
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Basic track types 



Extended track types 



Points (P) 



Linked Points (LP) 





Valued Points (VP) 



Linked Valued Points (LVP) 



1 — □— C 



-O- 




Segments (S) 



Linked Segments (LS) 



n 




Valued Segments (VS) 



Linked Valued Segments (LVS) 




Genome Partition (GP) 



Linked Genome Partition (LGP) 




Step Function (SF) 



Linked Step Function (LSF) 





Function (F) 



Linked Function (LF) 




Linked Base Pairs (LBP) 

Figure 1 Illustration of the geometric properties of the fifteen track types. The base line is a genome, or a sequence, on which the tracks 
are defined. Vertical lines represents positions, while horizontal lines represent the lengths of the track elements. Gaps are thus illustrated by any 
empty areas between the elements. Values are represented by the height of the vertical lines. Interconnections are represented by arrows, the 
thickness of which correspond to the weight of the edge. 
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tracks that include interconnections are presently in 
little use, enough datasets exist to warrant the defini- 
tion of all the linked track types, at least for complete- 
ness. For example, the recent Hi-C dataset of Dekker 
et al. [4] partitions the genome into 1 Mbp regions 
(for the genome-wide case), where each pair of regions 
has an associated proximity value. This dataset is then 
of type Linked Genome Partition (LGP), where every 
region has a weighted edge to all other regions. More 
traditionally, one could envision a gene/protein path- 
way being represented as gene segments, perhaps also 
with associated expression data, being linked together 
with directed edges representing associations (binding, 
activation, inhibition, etc.). This would be of type 
Linked Valued Segments (LVS). Note that a track type 
is considered linked if at least some track elements are 
interconnected. 

To complete the picture, a last track type needs to be 
defined. If only the interconnections core property is 
defined, track elements do not have gaps between them, 



lengths, or values. All base pairs are then track elements, 
with each base pair connected to other base pairs by 
edges, hence the name Linked Base Pairs (LBP). Think- 
ing in term of graphs, all base pairs will thus be nodes, 
although not all nodes need to have any edges. This, in 
contrast to the track type linked points, which limits the 
nodes to a specified set of points. The track type Linked 
Function (LF) is similar to linked base pairs, only adding 
an associated value to each base pair (node). The linked 
base pairs track type is mostly suggestive at this point, 
but at least theoretically, this would be the track type of 
the perfect three-dimensional track, mapping the dis- 
tance between all base pairs of a genome. Another 
example of a track of this kind is the representation of a 
randomization of a genome, with each edge representing 
the positional relocation of a base pair. We refer to the 
eight linked track types as the extended track types. Fig- 
ure 2 shows an overview of the relations between the fif- 
teen track types and the combination of core 
informational properties defined. 




Figure 2 Four-dimensional matrix mapping the relations of the fifteen track types. Each dimension represents the exclusion (0) or 
inclusion (1) of one of the four core informational properties: gaps, lengths, values and interconnections. The track type abbreviations in the 
top-left box are: Genome Partition (GP), Points (P) and Segments (S); in the bottom-left box: Function (F), Step Function (SF), Valued Points (VP) and 
Valued Segments (VS); in the top-right box: Linked Base Pairs (LBP), Linked Genome Partition (LGP), Linked Points (LP) and Linked Segments (LS); and 
in the bottom-right box: Linked Function (LF), Linked Step Function (LSF), Linked Valued Points (LVP) and Linked Valued Segments (LVS). The track 
types with white background (with gaps) are the sparse track types, while the ones with grey background (without gaps) are the dense track 
types. See Figure 1 for a geometric illustration of the track types. 
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Formal model of genomic tracks 

Formally, we base the discussion of track types on a spe- 
cific mathematical model of genomic tracks. We treat 
the genomic coordinates as forming a discrete metric 
space on the natural numbers, defined by the discrete 
metric d: 

d{a,b) = \a - b\ + 1, a, b e N (1) 

The genomic coordinates in the model are thus iso- 
lated points. A segment or interval starting at a position 
a and ending at b is defined as the subset S of natural 
numbers where: 

S(a, b) = {s e S | a < s < b a b > a} (2) 

The length of a segment is defined by the metric d, 
and is equal to the number of elements in the set. The 
length of the segment 5(1, 3) = {1, 2, 3} is thus d(l, 3) = 
|1 - 3 1 + 1 = 3 = 1 5(1, 3)|. Transferred to the biological 
domain, the length of a segment is the number of base 
pairs covered by the segment. The end position of a seg- 
ment must be larger than the start position. We thus 
exclude segments of length 1 from the model, as such 
segments would be exactly equal to a point, e.g. the set 
of a single number: 

P{a) = {peP\p = a} (3) 

From the set notation follows that a point P can be 
precisely defined as falling inside a segment 5 if and 
only if P <= 5. Two segments, on the other hand, may 
partially overlap. A function is precisely defined as a 
mathematical function from genomic coordinates to 
corresponding values, e.g. f = N — > R. A step function is 
similarly a function from disjoint intervals covering the 
entire domain to corresponding values. 

Analysis dependency on track types 

As each of the fifteen track types implies a set of core 
informational properties, a track type also poses a limit 
to which analyses are appropriate for a track. It makes 
sense to calculate the base pair coverage of a track of 
genes (type: segments), but not for a track of SNPs (type: 
valued points), which should instead be counted. This 

Table 1 Relation between analyses and track types 

Points Segments Function 

Points Different frequencies? Located inside? 
Segments Overlap? 

Function 



logic also carries on to analyses applied to more than 
one track. Consider, for the sake of simplicity, only five 
of the fifteen track types. If we select two tracks, each of 
one of these five types, we get 15 combinations, pro- 
vided that the order of the tracks is not important. Each 
of these combinations could then define a set of appro- 
priate analyses. Table 1 provides analysis examples for 
many of the pairwise combinations of the five track 
types points, segments, function, valued points, and 
valued segments. Although assigned to a single combina- 
tion of track types, an analysis may often be meaningful 
for a set of such combinations. For instance, asking 
whether the points of one track are located inside the 
borders of the segments of another track [points vs seg- 
ments) will trivially also give meaning where one or 
both of the tracks has associated values (e.g. valued 
points vs valued segments). Also, it could give meaning 
to ask whether small segments of one track are located 
inside the borders of the segments of another track (e.g. 
for the segments vs segments combination). The corre- 
spondence between the track types and possible analyses 
are at the core of the idea of track types. Although stor- 
ing data sets as efficiently as possible is an important 
aspect, the bioinformatics field is currently lagging more 
in terms of general understanding and standards for 
analyzing data sets in meaningful ways. It is our hope 
that the definition of track types will help in this regard. 

Existing representational formats 

Existing formats for representing genomic tracks can 
broadly be divided into three groups: textual formats, 
binary formats, and XML formats. Often textual and 
binary formats are closely connected, such as the SAM 
and BAM formats for read alignments [11]. This duality 
is due to the different advantages of the two forms. Tex- 
tual formats are often humanly readable and simpler to 
parse and manipulate than their binary alternatives. The 
binary formats, on the other hand, are more compact 
and more efficient to use, often incorporating indexing 
schemes for fast random access to data. XML formats 
aim to bridge this gap by defining data structures that 
can exist in both textual and binary forms. Note that we 
limit the discussion to formats that aim at being general, 



Valued Points Valued Segments 



Higher values at locations? Located in highly valued segments? 

Higher values inside? 
Correlated? 

Valued Points Nearby values similar? Categories differentially located in targets? 

Valued Segments 



Examples of analyses for different combinations of track types (using only five of the fifteen defined track types). Note that many of these analyses are valid for 
several (though not all) combinations, and are assigned to what we consider the most typical combination for the analysis. All these analyses are carefully 
described significance tests [10], available online at the Genomic HyperBrowser [28] 
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in one form or another, thus excluding formats that are 
special to a particular technology or platform. 

The large majority of formats for genomic data are 
textual, and the large majority of the textual data for- 
mats are tabular, that is, they consist of tab-separated 
columns. Three of the most common tabular formats 
are Generic Feature Format (GFF) [2], Browser Extensi- 
ble Data format (BED) [4] and Wiggle Track Format 
(WIG) [8]. Figure 3 shows an overview of these three 
tabular formats, with example files. 

A main reason for the popularity of tabular formats is 
that they are inherently simple to create and read, both 
manually and by computers. This has been a major 
asset in the field of bioinformatics because of the wide- 
spread use of both ad hoc scripting and WYSIWYG 



editing in spreadsheet software (such as Microsoft 
Excel). Still, the abundance of different formats, together 
with the increased complexity of particular formats, cre- 
ates practical problems when e.g. creating new tools. 

XML formats represent a way of letting go of the 
entire process of custom and explicit parsing of files. In 
particular when an XML format is specified by a dedi- 
cated XML Schema (abbreviated XSD, from XML 
Schema Definition), the data included in an XML docu- 
ment can be automatically transformed into convenient 
runtime data objects. XML formats are much used in 
connection with Web services, XML databases, or serial- 
izations of object models, but there have so far been 
only a few XML formats used for exchanging sequence- 
feature data. The Distributed Annotation System [12] 



(A) General Feature Format - GFF 



Begins with a set of meta- 
information header lines 

Divided into 9 fixed columns 
(including start and end) 

The last column may include 
any number of row-specific at- 
tributes, i.e. tag=value pairs 

Currently in version 3 



##gf f-version 3 

##sequence-region ctgl23 1 1497228 



ctgl23 


gene 


1000 


9000 


+ 




ID=gene00001 ; Name=EDEN 


ctgl23 


mRNA 


1050 


7000 


+ 




ID=mRNA00001 ; Name=EDEN . 1 ; Parent=gene00001 


ctgl23 


mRNA 


1300 


7000 


+ 




ID=mRNA00002 ; Name=EDEN . 2 ; Parent=gene00001 


ctgl23 


exon 


1300 


1500 


+ 




ID=exon00001 ; Parent=mRNA00002 


ctgl23 


exon 


3000 


3902 


+ 




ID=exon00002 ; Parent=mRNA00001 , mRNA00002 


ctgl23 


exon 


5000 


7000 


+ 




ID=exon00003 ; Parent=mRNA00001 , mRNA00002 


ctgl23 


CDS 


1201 


1500 


+ 


0 


ID=cds00001 ; Name=edenprotein . 1 ; Parent =mRNA00001 


ctgl23 


CDS 


3000 


3902 


+ 


0 


ID=cds00001 ; Name=edenprotein . 1 ; Parent =mRNA00001 


ctgl23 


CDS 


5000 


5500 


+ 


0 


ID=cds00001 ; Name=edenprotein . 1 ; Parent =mRNA00001 


ctgl23 


CDS 


3301 


3902 


+ 


0 


ID=cds00003 ; Name=edenprotein . 2 ; Parent=mRNA00002 


ctgl23 


CDS 


5000 


5500 


+ 


1 


ID=cds00003 ; Name=edenprotein . 2 ; Parent=mRNA00002 


ctgl23 


CDS 


3391 


3902 


+ 


0 


ID=cds00004 ; Name=edenprotein . 3 ; Parent=mRNA00002 


ctgl23 


CDS 


5000 


5500 


+ 


1 


ID=cds00004 ; Name=edenprotein . 3 ; Parent=mRNA00002 



(B) Browser Extensible Data - BED 



• Divided into 12 fixed columns, 








3 of which are required (in- 








cluding start and end) 


chr22 


1000 5000 cloneA 960 


+ 1000 5000 0 2 567,488, 0,3512 


• Contains positional informa- 


chr22 


2000 6000 cloneB 900 


- 2000 6000 0 2 433,399, 0,3601 


tion of blocks (subsegments), 








e.g. exons of genes 










(C) Wig 


gle Track Format 


- WIG 



• Each block of data starts with 
a declaration line. Two types: 
variableStep for irregular and 
fixedStep for regular position- 
ing of track elements 

• Span attribute defines the 
length of each track element 

• Step defines the distance be- 
tween starts of each element 
(fixedStep only) 

• Two columns for variableStep 
(start and value), one for 
fixedStep (value) 

Figure 3 Overview of three common tabular formats. A) Generic Feature Format (GFF). The example file is a reduced version of the main 
example of the GFF version 3 specification [2]. B) Browser Extensible Data format (BED). The example file is fetched from the specification of the 
format at UCSC [4]. C) Wiggle Track Format (WIG) [8]. The example files show the two subformats variableStep and fixedStep. The track elements 
in the variableStep file covers single base pairs (span = 1, as default) and contains sparse data. For the fixedStep file, the step attribute is equal to 
the span attribute. The fixedStep file thus contains dense data. Figure 4 shows GTrack conversions of these example files. 



variableStep chrom=chrl 
201 25.0 
301 26.0 
351 25.0 
476 23 . 0 

variableStep chrom=chr2 
151 10.0 
251 11.0 
13.0 
14.0 



376 
501 



fixedStep chrom=chrl start=201 step=100 span=100 

25.0 

26.0 

25.0 

23.0 

fixedStep chrom=chr2 start=151 step=100 span=100 

10.0 

11.0 

13.0 

14.0 
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uses the DASGFF XML format, which is similar to the 
tabular GFF. Web services for feature prediction at CBS 
[13] have been using a common XSD-based output for- 
mat that has been inspired by GFF. Numerous Web ser- 
vices and databases define their own XML formats for 
annotation data, such as the UniProt XML [14] or the 
ELMdb Web service [15]. BioXSD version 1.0 has 
defined a format for sequence features that is expressive 
enough to be able to substitute the majority of other 
feature formats [16]. The main disadvantages of using 
XML for genome-scale annotations have been the verb- 
osity of the textual serialization of XML data and the 
large memory usage of most of the libraries parsing 
XML. The recent W3C standard for highly optimized 
binary representation of XML - the Efficient XML Inter- 
change (EXI) format [17] - promises to solve these 
problems. 



Binary formats are often used internally in software 
systems, and not necessarily provided as public formats. 
Some exceptions to this are the aforementioned BAM, 
as well as the bigBed and bigWig formats [18]. The last 
two formats are binary versions of the BED and WIG 
format, respectively, providing efficient storage and 
indexing capabilities, allowing users to store large tracks 
on their own computers, while a server requests only 
the parts needed for analysis or visualization. Another 
binary format is the USeq Compressed Binary format 
[19] focusing on tight compression of tabular data files 
of different types, while keeping them in an indexed 
structure. 

As Figure 3 illustrates, different formats support dif- 
ferent combinations of the core informational proper- 
ties, and hence, different track types. Table 2 provides 
an overview of which of the basic track types are 



Table 2 The track types supported by existing tabular, binary and XML formats 



Format 


Ref. 


Data 


Repr. 


P 


s 


VP 


vs 


GP SF 


F 


L 


Strand 


#Cols 


Value type 


GFF3/GTF 


[2] 


General 


Tab. 


S ] 






/ 






2 


/ 


9 


Float 3 


BED/bigBed 


[4] 


General 


Tab/Bin. 


S ] 


/ 


y ] 


/ 






2 


/ 


3-12 


lnt(0-1000)/string 4 


BED15 


[4] 


Microarray 


Tab. 








/ 






2 


/ 


15 


List of floats 5 


bedGraph 


[4] 


General 


Tab. 








/ 










4 


Float 


WIG/bigWig (fixedStep) 


[8] 


General 


Tab/Bin. 






/ 


/ 


/ 


/ 






1 


Float 


WIG/bigWig (variableStep) 


[8] 


General 


Tab/Bin. 






/ 


/ 










2 


Float 


CNT 


[36] 


Copy number 


Tab. 






/ 












4 


Float 


Personal Genome SNP 


[4] 


Variation 


Tab. 








/ 










7 


String 6 


VCF 


[37] 


Variation 


Tab. 








/ 










> 8 


String 63 


GVF 


[6] 


General/Variation 


Tab. 




/ 




/ 






2 


/ 


9 


Float 3 


PSL 


[4] 


Alignment 


Tab. 




/ 




/ 








/ 


21 


Int 7 


SAM/BAM 


[38] 


Alignment 


Tab/Bin. 




/ 




/ 








/ 


11 


Int/string 8 


BioHDF 


[39] 


Alignment 


Bin. 




/ 




/ 








/ 


11 


Int/string 8 


MAF 


[4] 


Multiple Alignment 


Tab. 




/ 




/ 






9 


/ 


2-7 


Float/string 8 


FASTA 


[40] 


Sequence 


Text 












/ 






N/A 


Char 


DAS XML 


[12] 


General 


XML 


/ 1 


/ 


/ 1 


/ 






2 


/ 


N/A 


Float 


BioXSD 1.0 


[16] 


General 


XML 
















/ 


N/A 


Float 12 


USeq 


[19] 


General 


Bin. 


/ 


/ 


/ 


/ 








/ 


N/A 


Int/float/string 


Genomedata 


[41] 


General 


Bin. 






/ 


/ 


/ 


/ 






N/A 


Int/float/char 



The track type abbreviations are as follows: Points (P), Segments (S), Valued Points (VP), Valued Segments (VS), Genome Partition (GP), Step Function (SF), and 
Function (F). L refers to any of the linked track types. The table also denotes whether the format supports specification of strand, the number of columns of the 
tabular formats, and the type of the dominant value, if any. 

1 Points are specified using both start and end values. There is no way of specifying that a file contains only points. 

2 Only a special case of linked segments is supported, namely part-of relationships, such as en exon being a part of a gene. 

3 The chosen value type refers to what may be considered the main score column of the format. The format also includes a configurable column containing 
values that may be extracted by specialized parsers. 

4 We limit the bigBed format to the standard BED columns for simplicity, as the bigBed format is highly customizable through the use of AutoSQL configurations. 

5 The float values represent a set of gene expression values from microarray experiments. 

6 The values represent the possible alleles at a SNP position. Also, the allele frequencies and quality scores are reported and could be used as values. 

7 E.g. the number of bases that match/do not match. 

8 E.g. the mapping quality or the aligned sequence itself. 

9 Links to alignments in other genomes. 

10 There is no way of specifying that a record contains only points or only segments. 

11 No weights are supported in BioXSD 1.0. 

12 Numerical values are always signed double precision floats (8 bytes). A limited set of other value types is also allowed {e.g. sequence variation and 
alignments). 
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covered by some common formats. As each of the dif- 
ferent groups of formats (tabular, XML, and binary) has 
advantages in distinct scenarios and communities, one 
would ideally like to select three formats that cover all 
track types, one from each group. Unfortunately, no 
common formats do. One option would be to extend an 
existing format to support all track types. A main reason 
for such an extension would be to be able to make use 
of the plethora of tools and parsers already available. In 
the case of XML formats, the existing BioXSD 1.0 for- 
mat was found to be easily extensible to support all 
track types. In the case of tabular formats, however, the 
only major format to support extensions is GFF, 
through the attribute column. However, using GFF to 
represent e.g. tracks of type function would be highly 
impractical. Each base pair would then be represented 
by a data line of nine columns, wasting considerable 
amounts of space. The remaining option is then to cre- 
ate a new tabular format. In order for the introduction 
of a new format to be justified, such a format should 
have the potential to replace at least some of the exist- 
ing formats, in addition to having the extensibility 
required to meet future needs when new types of data 
appear. As binary formats are often not independent 
formats, but typically linked to tabular ones, we will not 
focus on such formats here. We thus present a pair of 
general formats aware of all track types, one of which is 
tabular and the other based on XML. The tabular for- 
mat, GTrack 1.0, is a new format that builds closely on 
the BED and WIG formats, while adding support for 
extensions in a similar fashion as in GFF. The XML for- 
mat is a successor of the existing BioXSD 1.0 format. 
Besides catering to a broader user base, presenting 
"track type"-compliant formats of both kinds illustrates 
that the fundamental concepts of track type are inde- 
pendent of implementation. The primary goals for the 
formats are to support all track types systematically, to 
allow custom extensions, and to provide efficient sto- 
rage, while at the same time focusing on simple parsing 
and manipulation of files. 

GTrack: Type-aware tabular format 

We here introduce a new tabular track format: the 
GTrack format, short for both "Genomic Track" and 
"Generic Track". The GTrack format supports all fifteen 
previously defined track types, illustrated in Figures 1 
and 2. A GTrack file includes a column specification 
line, specifying the names of all the columns in the file. 
Each track type has a one-to-one correspondence to a 
combination of core columns being present in the col- 
umn specification line, as detailed in Table 3. The four 
core informational properties are represented by the 
four core reserved columns in such a way that the exis- 
tence of each core column {start, end, value, and edges) 



corresponds to a core property being defined {gaps, 
lengths, values, and interconnections, respectively): 

♦ Gaps are implicitly represented by the start col- 
umn, i.e. it holds the start coordinate of a track ele- 
ment and thus marks the end of any preceding gap. 

♦ For sparse track types, i.e. track types with gaps, 
length is implicitly represented by the difference 
between start and end columns. For dense track 
types (without gaps), there is no start column. The 
length is then the difference between the previous 
end position and the current. Deriving length from 
the end position, rather than the start position, is 
preferable, as a parser in the opposite case would 
have to read the subsequent line before concluding 
on the length of the current track element. The exis- 
tence of the end column thus corresponds directly to 
the track elements having the length property. 

♦ Although several columns in a data set may con- 
tain values of potential interest, one of these col- 
umns will typically provide a main value used in 
processing or analysis according to a given purpose. 
This focus is specified by the value column. 

♦ The edges column contains, for each track element, 
a comma-separated list of id's of other track ele- 
ments which are interconnected with the element in 
question, in addition to values associated to the 
edges, e.g. weights or edge types 

♦ A GTrack file may contain several columns con- 
taining values or edges. Users may then switch 
between them by simply editing the column specifi- 
cation line. 

The edges column requires that the non-core reserved 
column id is present, containing a unique identifier for 
each track element. Three other non-core columns are 
specified in the GTrack format: genome, seqid and 
strand (see Table 3). The titles of the eight reserved col- 
umns are reserved words in the column specification 
line. They may appear in any order, and any number of 
additional columns may be specified. Figure 4 shows six 
example GTrack files, five of which are conversions of 
the example files in Figure 3. The example files illustrate 
the variation stemming from the different column speci- 
fication lines (starting with the characters '###'). 

When creating the GTrack format, we have empha- 
sized simplicity, both for creation, manual reading and 
automated parsing of the format. We have identified 
three principles towards simplicity: independence of 
data lines, overview of structural characteristics and 
equally sized lines. 

The principle of independent data lines states that it 
should be possible to interpret each data line in a tabu- 
lar format independently of its location in the file. This 
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Table 3 Overview of the reserved columns in the GTrack format and their associations to track type 



Core property: 




Gaps 


Lengths 


Values 






Interc. 


GTrack column(s): 


genome seqid 


start 


end 1 


value 


strand 


id 


edges 2 


Type of column: 


N N 


c 


c 


c 


N 


N 


c 


Track type 
















Points (P) 


? , 


/ 






7 


? 




Segments (S) 


? \ 


/ 


/ 










Genome Partition (GP) 


? J 




/ 










Valued Points (VP) 


? j 


/ 




/ 








Valued Segments (VS) 


? j 


/ 


/ 


/ 








Step Function (SF) 


? J 




/ 


/ 








Function (F) 


? \ 






/ 








Linked Points (LP) 




/ 








/ 


/ 


Linked Segments (LS) 




/ 


/ 






/ 


/ 


Linked Genome Partition (LGP) 






/ 






/ 


/ 


Linked Valued Points (LVP) 




/ 




/ 




/ 


/ 


Linked Valued Segments (LVS) 




/ 


/ 


/ 




/ 


/ 


Linked Step Function (LSF) 






/ 


/ 




/ 


/ 


Linked Function (LF) 








/ 




/ 


/ 


Linked Base Pairs (LBP) 








/ 




/ 


/ 



C Core reserved column (defines track type) 

N Non-core reserved column (reserved, but does not define track type) 
>/ Column is mandatory 
? Column is optional 
. Column is not allowed 

! Property must be present, either as a column or in a bounding region specification 

1 The length is the difference between the end and the start position, or, if the start column is not present, the difference between the current end position and 
the previous. 

2 The non-core reserved column id is required when the edges column is present. 



is a principle followed in many common formats, e.g. 
GFF [2] or BED [4]. Following this principle gives sev- 
eral advantages. First, when creating or manipulating a 
file, keeping data lines independent allows the filtering 
and sorting of data lines while still keeping all the rele- 
vant information. Second, keeping a track element on a 
single line makes it easier to read for the human eye. 
Third, independent data lines reduce the need of auto- 
matic parsers to hold state information. The GTrack 
format follows the principle of independent data lines 
with two exceptions. First, data lines of dense track 
types are dependent on their positions in the file. Sec- 
ond, the GTrack format allows (and, in the case of 
dense track types, requires) the specification of bound- 
ing regions around each block of values. A bounding 
region specification line defines the domain of the fol- 
lowing track elements, i.e. the region where we have 
information about the features modeled by the track ele- 
ments. It is recommended that tracks mask out regions 
of a genome where nothing is known (such as centro- 
meres or assembly gaps) using bounding regions, rather 
than just omitting track elements or specifying 0-values, 
as the difference is important for many analyses. Bound- 
ing regions unfortunately require parsers to store state 



information. See Figure 4A, 4C1, 4C2 and 4D for exam- 
ples of bounding region specification lines (starting with 
the characters '####'). 

The principle of including an overview of structural 
characteristics means that a track file should start with a 
set of configurable options that describe the structure of 
the data lines, in an easily readable manner. Note that 
many of these characteristics will, by nature, include 
redundant information, i.e. that could have been col- 
lected from the data lines themselves. There are several 
reasons for explicitly stating such characteristics. First, it 
gives the human reader a simple overview of the type of 
data stored in the file, without having to scrutinize the 
actual data. Second, it allows the creator of a track to 
validate that the file is structured in the way intended 
(for this purpose, we also provide a web-based validator 
tool [20]). Third, inclusion of structural characteristics 
allows parsers to be restrictive on which kind of struc- 
tures to support. A quick script can then, for instance, 
read the header and check whether the track type is seg- 
ments with no overlapping elements, failing explicitly if 
the header does not match this requirement. The script 
can then assume that the remaining file follows the 
asserted structure, safely ignoring the non-relevant 
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(A) GFF as GTrack (ad hoc variant) 



##gtrack version: 1.0 
##track type: linked valued segments 
##value type : category 
##uninterrupted data lines: true 
##l-indexed: true 
##end inclusive: true 
###source 



value 


start 


end 


score strand 


phase 


id 


name 


edges 


:tgl23; 


start= 


1 ; end= 


=1497228 










gene 


1000 


9000 


+ 




geneOOOOl 


EDEN 




mRNA 


1050 


7000 


+ 




mRNAOOOOl 


EDEN . 1 


geneOOOOl 


mRNA 


1300 


7000 


+ 




mRNA00002 


EDEN . 2 


geneOOOOl 


exon 


1300 


1500 


+ 




exonOOOOl 




mRNA00002 


exon 


3000 


3902 


+ 




exon00002 




mRNAOOOOl ;mRNA00002 


exon 


5000 


7000 


+ 




exon00003 




mRNAOOOOl ;mRNA00002 


CDS 


1201 


1500 


+ 


0 


cdsOOOOl.l 


edenprotein. 1 


mRNAOOOOl 


CDS 


3000 


3902 


+ 


0 


cds00001.2 


edenprotein. 1 


mRNAOOOOl 


CDS 


5000 


5500 


+ 


0 


cds00001.3 


edenprotein. 1 


mRNAOOOOl 


CDS 


3301 


3902 


+ 


0 


cds00003.1 


edenprotein. 2 


mRNA00002 


CDS 


5000 


5500 


+ 


1 


cds00003.2 


edenprotein. 2 


mRNA00002 


CDS 


3391 


3902 


+ 


0 


cds00004.1 


edenprotein. 3 


mRNA00002 


CDS 


5000 


5500 


+ 


1 


cds00004.2 


edenprotein. 3 


mRNA00002 



(Bl) BED as GTrack (direct variant) 



##gtrack version: 1.0 
##track type: valued segments 
##uninterrupted data lines: true 
###seqid start end id value 
chr22 1000 5000 cloneA 960 

chr22 2000 6000 cloneB 900 



strand thickStart thickEnd it< 
+ 1000 5000 0 

2000 6000 0 



blockCount 

2 

2 



blockSizes blockStarts 
567,488, 0,3512 
433,399, 0,3601 



(B2) BED as GTrack (linked variant) 



##gtrack version: 1.0 

##track type: linked valued segments 

##uninterrupted data lines : true 



###seqid 


start 


end 


id 




value 


strand 


thickStart 


thickEnd 


itemR 


lb edges 


chr22 


1000 


5000 


cloneA 




960 


+ 


1000 


5000 


0 




chr22 


1000 


1567 


cloneA 


1 


960 


+ 








cloneA 


chr22 


4512 


5000 


cloneA 


2 


960 


+ 








cloneA 


chr22 


2000 


6000 


cloneB 




900 




2000 


6000 


0 




chr22 


2000 


2433 


cloneB 


1 


900 










cloneB 


chr22 


5601 


6000 


cloneB 


2 


900 










cloneB 



(CI and C2) WIG as GTrack 



##gtrack version: 1.0 

##track type: valued points 

##l-indexed: true 

##end inclusive: true 

###start value 

####seqid=chrl 

201 25.0 

301 26.0 

351 25.0 

476 23.0 

####seqid=chr2 

151 10.0 

251 11.0 

376 13.0 

501 14.0 



##gtrack version: 1.0 
##track type: step function 
##l-indexed: true 
##end inclusive: true 
###end value 

####seqid=chrl; start=201; end=600 
300 25.0 
400 26.0 
500 25.0 
600 23.0 

####seqid=chr2; start=151; end=550 
250 10.0 
350 11.0 
450 13.0 
550 14.0 



(D) Linked genome partition in GTrack 

##gtrack version: 1.0 

##track type: linked genome partition 

##edge weights: true 

##undirected edges: true 

# original column names 

#end id directed undirected 
###end id directed edges 



####seqid=chrl; start=0; 
100 0 2=1.4 2= 

200 1 

300 2 6=2.3 0= 



end=300 
1.4 



1.4;5=2.3 



####seqid=chr2; start=150; end=600 
300 3 4=1.2 4=1. 2; 5=1. 7 

450 4 3=1.2 
600 5 3=1.7 2=2.3;3=1.7 



Figure 4 GTrack example files. A) GTrack version of the GFF file in Figure 3A. GTrack conversions of GFF vary according to the set of attributes 
present in the GFF file. The column selected as the main value may also be changed. B1 and B2) Two possible GTrack conversions of the BED 
file in Figure 3B. In the direct variant (B1) only a "track type" header line and a column specification line are added. The exon positioning will in 
this case not be understood by a general GTrack parser. The linked variant (B2) expands the exons into subsegments that links to their parent 
gene segment. C1 and C2) GTrack conversions of the WIG files in Figure 3C. The variableStep file has sparse track elements covering single base 
pairs, with associated values. The track is thus of type valued points. The fixedStep file contains dense data, with the same values for a series of 
consecutive base pairs. The track type is thus of type step function. Note that in the last example, the end values are used for positioning. D) 
Example GTrack file of type linked genome partition. Here two graphs are defined, one directed and one undirected. To change the active graph, 
the edges column in the column specification line needs to be changed, in addition to the "undirected edges" header line. The example GTrack 
files are available at [20]. BioXSD 1.1 versions of the examples are available as follows: A [21], B1 & B2 [22], C1 [23], C2 [24], and D [25]. 



generality of the GTrack specification. In the GTrack 
format, the structural characteristics are specified in 
header lines, starting with the characters '##'. Table 4 
contains an overview of all GTrack header variables. 
Note that header lines are optional when their values 
are equal to the default values. We also provide the 



"Expand GTrack headers" tool, which generates a 
GTrack file with full headers based on a supplied, 
incomplete GTrack file, further simplifying the process 
of generating header lines. 

The principle of equally sized lines states that all 
data lines contain the same number of columns, i.e. 
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Table 4 Overview of the header variables of the GTrack format 

Header variable Description Default 

value 

GTrack version Version of the GTrack specification used 1 .0 

Track type Track type of the GTrack file segments 

Value type The kind of content accepted in the value column number 

Value dimension The dimension of the content in the value column scalar 

Undirected edges Whether all edges are undirected false 

Edge weights Whether the edges have weights false 

Edge weight type The kind of content accepted as edge weights number 

Edge weight The dimension of the edge weights scalar 

dimension 

Uninterrupted data Whether it is guaranteed that the data lines are not interrupted by bounding region specification lines or false 
lines comments 

Sorted elements Whether it is guaranteed that all bounding regions and track elements come in sorted order false 

No overlapping Whether it is guaranteed that no two track elements overlap false 

elements 

Circular elements Whether any track elements or bounding regions cross the coordinate borders of a circular sequence false 

1 -indexed Whether the coordinates start at 1 (0 if false) false 

End inclusive Whether the coordinates specified in the end column is included in intervals false 

*Value column The name of the column to be used for as the Value' column value 

*Edges column The name of the column to be used for as the 'edges' column edges 

*Fixed length Fixed length of all track elements 1 

*Fixed gap size Fixed-size gaps between all neighboring track elements 0 

^Fixed-size data lines Whether each data line has an exact size in terms of number of characters false 

*Data line size The size of each data line in terms of number of characters 1 

"GTrack subtype The name of the subtype of the GTrack format specification used for the file (empty 

string) 

"Subtype version The version of the GTrack subtype 1.0 

"Subtype URL URL to a GTrack file used as a specification/model for the GTrack subtype (empty 

string) 

"Subtype adherence Regulates the way a GTrack file may override the subtype specification free 
All header variables not specified in a GTrack file retains their default values. 

* Defined in the extended part of the GTrack specification. See the GTrack specification (Additional file 1) for more details. 



that all attributes have a value. Columns that do not 
contain information are marked with a period charac- 
ter. There are several advantages for this solution com- 
pared to the solution used in the GFF format, where 
the last column may contain a list of attributes in the 
format tag = value, allowing the attribute list to differ 
for each line. First, having equal size columns allows 
validation that all data lines are complete, or at least 
that the creator of the track has considered all attri- 
butes for all track elements. With a variable size attri- 
bute column, there is no way to check that all 
attributes have been considered. Second, parsing attri- 
bute lists as in the GFF format is more cumbersome, 
as the parser will not in advance know which attri- 
butes may appear in the file. Third, not having to 
repeat attribute names for all lines saves some space. 
Fourth, and most importantly, having the same num- 
ber of columns in each data line keeps the interface of 
the format coherently organized, with attributes as col- 
umns and track elements as rows. As the GTrack 



format supports custom columns, it can completely 
replace the attribute solution of the GFF format. 

In addition to simplicity, the GTrack format aims at 
being highly extensible and inter-operable. First, the 
ability to define columns in any order and number, pro- 
vides ample options for extensibility, in addition to sim- 
plifying conversion. In many cases, converting another 
tabular format to GTrack is as simple as adding a col- 
umn specification line. Note that basic, three-column 
BED files are directly compatible with the GTrack for- 
mat, without the need for any modifications. Also, both 
0- and 1-based indexing, in addition to the end position 
being inclusive or exclusive, are included in the GTrack 
specification, further simplifying conversion. Second, 
GTrack includes a strategy for making structured exten- 
sions of the format, namely the specification of subtypes. 
Four subtype header lines are available (see Table 4), 
specifying the name and version of a subtype, the URL 
of the subtype specification, and the strictness of adher- 
ence required by the subtype. The idea is that research 
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communities can define their own tabular formats, mak- 
ing use of a subset of the GTrack specification. Such 
formats could for instance be replacements of existing 
formats, or formats that are honed to specific technolo- 
gies or tools. The header variable "subtype URL" points 
to a GTrack file that can be used as model for the sub- 
type, and is intended to be read by automatic parsers. 
Figure 5A shows an example of such a subtype specifi- 
cation file, based on the example GTrack file in Figure 
4A. Specifying subtype models allows the reduction of a 
complete GTrack header down to a minimum of one 
line, as shown in Figure 5B. It is our belief that allowing 
extensions of the GTrack format via subtypes caters for 
a range of future extensions, while ensuring backward 
compatibility. Subtypes can be defined in a range of set- 
tings, from project specific, ad hoc solutions, to the spe- 
cification of generic formats. Further examples of 
GTrack subtypes are described in the GTrack specifica- 
tion (Additional file 1). A set of standard GTrack sub- 
types are available online [20] (including subtypes 
corresponding to the example files in Figure 4). 

BioXSD 1.1: Enhanced and optimized XML format 

BioXSD has been developed as a universal XML format 
for the basic types of bioinformatics data that is in parti- 
cular suitable to be used with Web services [16]. It 
models common types of data for which a specialized 
XML Schema (XSD) has not been widely adopted: bio- 
molecular sequences, alignments, sequence feature 
records, and references to ontologies and data resources. 
The BioXSD schema defines formats of data but not 
formats of particular XML documents, by defining XSD 
types but no global XML elements. BioXSD types can 
thus be used according to applications' needs in applica- 
tions' own XSDs such as those in WSDL files of Web 



BioXSD 1.0 type AnnotatedSequence can represent 
annotations of a biomolecular sequence or genome with 
any types of positioned or non-positioned features, 
which can be combined in one record. Although the 
textual serialization of XML is in general more verbose 
than a tabular format, already the BioXSD 1.0 has 
included a number of optimizations compared to tradi- 
tional feature formats like GFF or BED, thanks to the 
tree-like structure of XML. These have been mainly: 

♦ not repeating the reference to a sequence in every 
feature occurrence 

♦ not repeating the type of feature in every feature 
occurrence 

♦ representing multi-segment and multi-point feature 
occurrences in one feature-occurrence element 

The goal of BioXSD version 1.1 has been to further 
improve the expressiveness of the BioXSD formats and 
at the same time focus on optimizations of the data size. 
The successor of BioXSD 1.0 AnnotatedSequence is 
BioXSD 1.1 type FeatureRecord. BioXSD 1.1 in general 
allows more types of sequence positions, distinguishing 
them in the same way as the tabular GTrack format. 
Sparse positions are segments, points (actual points or 
insertions), and outer positions. Dense positions have 
been added: dense points [function) marked-up by < 
nextPoint/> empty elements; and dense partition or step 
function marked by < nextPartition max="...7> elements 
including the border position where each interval ends. 
However in contrast to GTrack, the different types of 
positions can still be freely combined within a Feature- 
Record. The representation of all types of sequence 
positions have been refactored, simplified, and opti- 
mized. Another crucial set of optimizations allows speci- 
fication of the ontologies, databases, and computational 



##gtrack version: 1.0 
##gtrack subtype : GFF ad hoc 
##subtype version: 1.0 
##subtype adherence: redefinable 
##track type: linked valued segments 
##value type: category 
##l-indexed: true 
##end inclusive: true 

###source value start end score strand phase id name edges 



##subtype URL: http://gtrack.no/gff_ad_hoc.gtrack 
####seqid=ctgl23; start=l; end=1497228 
~Q . gene 1000 9000 . + . geneOOOOl EDEN 

(...) 

Figure 5 GTrack subtype example. A) An od hoc GTrack suptype specification based on the example GTrack file in Figure 4A, which is a 
conversion from the GFF file in Figure 3A. This and other GTrack subtypes are available from the GTrack website [20]. B) A minimal GTrack 
header, parsable by fully compliant GTrack parsers. Note that the "Expand GTrack headers" tool, available from the GTrack website [20], can be 
used to expand headers of GTrack files using subtypes, in order for such files to be used in simpler parsers that do not support the subtype 
functionality. 
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tools of interest in a condensed way for a list of feature 
annotations, so that they do not have to be repeated. 
Detailed contents of the BioXSD feature record are 
listed in Table 5. Examples of data represented in 
BioXSD 1.1 format are available at [21-25]. 

There is one slight difference in how the G Track and 
BioXSD deal with focus of feature records. GTrack 
defines one operational focus of a concrete dataset. That 
is the reason why it allows to specify only one type of 
track locations and only one value column and one 
edges column at a time, although other values and edges 
may still be "hidden" in out-of-focus columns. BioXSD 
on the other hand allows combining features, types of 
track positions, values, and interconnections freely with- 
out any operational focus. Thus, if a tool consuming 



BioXSD feature data demands it, a particular operational 
focus of the data must be supplied by the user. 

Compared to other generic sequence-feature formats, 
BioXSD allows defining complex, structured meanings 
of annotations, as well as complex feature data and 
metadata, or relations. This would not be conveniently 
possible in a tabular format and takes advantage of the 
XML. BioXSD types can freely be combined and 
included within documents, files, or applications' inputs 
and outputs. They can easily be combined with other 
XML formats defined in other XSDs, can be extended 
just like classes in an object-oriented programming lan- 
guage, or further restricted using built-in XSD mechan- 
isms. BioXSD can be validated and parsed by ordinary 
XML/XSD-handling frameworks. 



Table 5 The allowed content of a BioXSD FeatureRecord 



Notes 



May further contain 



BioXSD 
description of 
feature type 

Name 

Ontology 
concepts 

Synonyms 

Textual note 

References 

More specific type 
of feature 

More generic class 
of feature types 



to database entries, databases, ontology concepts, other feature types 



type of relationship with the referenced object 2 
name and/or concepts, synonyms, database entries 

name and/or concepts 



BioXSD feature 
occurrence 

Position 



Scores {values) 

Evidence 

Name 
Note 

Alignments 

Sequence 
variation 

Frame 

CDS phase 

References 



segments, points 3 , positions outside of the actual sequence or feature 
occurrence 4 , dense points* {function) and dense partition* or step 
function* 5 

double-precision signed floats (8 bytes), or any well-formatted strings* 



to ontology concepts, database entries, other feature occurences 
{interconnections) 



strand, certainty 



unit, index, type of score 2 , note, position, 
provenance metadata 

references to databases, tools, and citations; scores, 
verdict, reliability, provenance metadata 



alignment- and aligned sequence-specific scores, 
gaps, frameshifts, directions, note, provenance 
metadata 

variants, canonical variant, scores, position 



type of relationship with the referenced object 2 ; 
scores of the relationship {weights of edges)* 



1 At least one of these two is mandatory. 

2 By any ontology concept, referred to by a concept URI, identifier, or term; or by a custom term if no ontology concept is available. 

3 Points are bases/residues or insertions between them. 

4 For example if annotating the position of a regulatory element of a coding sequence, or relations between genes or protein domains. 

5 Positions can form multi-segment subsequences, multi-point tuples, and can be combined within feature occurrences according to users' needs. The positions 
are always 1 -based. The feature occurence may apply to the whole sequence (being a non-positioned sequence property). 

* Added in BioXSD version 1.1. 
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It has, however, been problematic to use XML formats 
for highly voluminous data such as whole-genome anno- 
tations. The textual serialization of XML is more ver- 
bose compared to a textual tabular format, and even 
more compared to a bespoke binary format. Many basic 
XML-handling tools have high runtime demands for 
computer memory, making parsing of huge XML docu- 
ments impossible. All these problems are hopefully 
going to be solved thanks to the recent and long- 
expected Efficient XML Interchange (EXI) standard by 
the World Wide Web Consortium [17], together with 
its growing family of supporting libraries, and tools for 
streamed XSLT transformations and random-access 
XPath and XQuery queries. EXI defines the way any 
XML data or document should be serialized in a stan- 
dard binary format that will be many times smaller and 
at the same time faster to access than the textual XML. 
There is no need to develop one's own bespoke binary 
encodings and parsers when using EXI, and the data 
can be programmatically handled transparently, with the 
same look and feel as the ordinary XML. 

Availability of specifications and supporting tools 

The BioXSD 1.1 XML Schema is available at [26]. 
BioXSD data can be validated by all the main XML vali- 
dation tools, and consumed and produced programmati- 
cally by the bulk of the common XML/XSD-handling 
libraries. Further information and documentation are 
available at [27]. 

A complete specification of the G Track format version 
1.0 is attached as Additional file 1 and is also available 
from the GTrack website [20]. The website also contains 
supporting tools for the GTrack format, connected to 
the Genomic HyperBrowser [10,28]. Table 6 contains an 
overview of all GTrack-related tools available as 
webtools. 

The GTrack format is maintained by Sveinung Gun- 
dersen and the BioXSD format is maintained by Matus 
Kalas. Both formats are licensed under the Creative 
Commons Attribution-NoDerivs 3.0 Unported License 
[29], 



The Genomic HyperBrowser [10,28] is built on top of 
the Galaxy framework [30,31] and provides a large set 
of statistical investigations tailored for the specific track 
types of supplied tracks. In order for such analyses to be 
efficient, the system uses a binary storage scheme 
internally. In this scheme, the core informational col- 
umns are stored as C vectors directly written to disk. 
The vector files are then accessed using the NumPy 
package [32] for Python [33], allowing very efficient vec- 
tor computations. A linear index of the files is built in 
order to allow random access to the data. This binary 
representation is stored in parallel to the files in their 
original format, and updated automatically as the origi- 
nal files are updated. The implementation is open 
source and available as part of the HyperBrowser code 
base under the GPL license, version 3 [34]. As an alter- 
native, the recently published Tabix tool [35] provides 
fast access to tabular data in compressed form, and 
works with GTrack files of types Points and Segments, 
and their derivatives. 

Conclusions 

By systematic analysis of informational properties of 
genomic tracks, we delineated fifteen distinct types of 
tracks. These track types shed light on the variability of 
track representations, suggesting that the differences 
between formats is not only due to preferences and con- 
ventions, but also to fundamental differences in the 
information inherent in different tracks. Furthermore, 
discerning the informational properties of a track allows 
the nature of the track to be precisely conveyed, as well 
as clarifying what represents meaningful analyses on a 
given track. 

The identification of core informational properties of 
tracks, as well as a broad survey of various practicalities 
concerning existing formats, created a basis for the spe- 
cification of a new format for genomic data: the GTrack 
format. By allowing precise interpretation, simple par- 
sing, as well as relatively straightforward conversion to 
several existing formats, we believe that the introduction 
of this "yet another format" will actually help streamline 



Table 6 Overview of the webtools available from the GTrack website [20] 


GTrack supporting tools 


Description 


Show GTrack specification 


Displays a HTML version of the GTrack specification 


Validate GTrack file 


Checks whether a GTrack file complies with the specification 


Convert tabular file to GTrack 


Converts any tabular file to GTrack 


Convert file to/from GTrack 


Converts to and from common tabular formats (GFF, BED, WIG, bedGraph) 


Expand GTrack headers 


Expands partially completed GTrack headers based on data contents 


Standardize GTrack file 


Converts a GTrack file to track type "linked valued segments" using the default indexing scheme 


Sort GTrack file 


Sorts a GTrack file (including bounding regions) 


Complement GTrack columns 


Complements the columns of a GTrack file based on another GTrack file 



All tools are implemented as part of the Genomic HyperBrowser [10,28] and available under the GPL license, version 3 [34]. 
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data representation in the field. Finally, by coordinating 
the GTrack format with an enhanced and optimized 
version 1.1 of the BioXSD format, this also aids in unify- 
ing tabular and XML-based track representation, while 
keeping the specific advantages of the two. 

Additional material 



Additional file 1: GTrack specification. Specification document of 
GTrack 1.0. 
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BAM: Binary Alignment/Map format; BED: Browser Extensible Data format; 
ChlP-seq: Chromatin Immunoprecipitation sequencing; EXI: Efficient XML 
Interchange; F: function; GFF: General Feature Format; GTF: Gene Transfer 
Format; GVF: Genome Variation Format; GP: genome partition; P: points; LBP: 
linked base pairs; LF: linked function; LGP: linked genome partition; LP: linked 
points; LS: linked segments; LSF: linked step function; LVP: linked valued 
points; LVS: linked valued segments; S: segments; SAM, Sequence 
Alignment/Map format; SF: step function; SNP: single nucleotide 
polymorphisms; URI: Uniform resource identifier; URL: Uniform resource 
locator; VP: valued points; VS: valued segments; WIG: Wiggle format; WSDL: 
Web Service Definition Language; WYSIWYG: what you see is what you get- 
XML: Extensible Markup Language; XSD: XML Schema Definition. 
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